All-fiber high-speed image detection enabled by deep learning

Ultra-high-speed imaging serves as a foundation for modern science. While in biomedicine, optical-fiber-based endoscopy is often required for in vivo applications, the combination of high speed with the fiber endoscopy, which is vital for exploring transient biomedical phenomena, still confronts some challenges. We propose all-fiber imaging at high speeds, which is achieved based on the transformation of two-dimensional spatial information into one-dimensional temporal pulsed streams by leveraging high intermodal dispersion in a multimode fiber. Neural networks are trained to reconstruct images from the temporal waveforms. It can not only detect content-aware images with high quality, but also detect images of different kinds from the training images with slightly reduced quality. The fiber probe can detect micron-scale objects with a high frame rate (15.4 Mfps) and large frame depth (10,000). This scheme combines high speeds with high mechanical flexibility and integration and may stimulate future research exploring various phenomena in vivo.

Considering that the fastest and slowest modes have a delay difference of around 50 ns, we can predict that after transmitting through the 1-km MMF, a pulse will eventually split into a pile of isolated sub-pulses over a temporal range of 50 ns due to the intermodal dispersion.

Supplementary Note 2: Fiber Probe
The section refractive index distribution of the fiber probe is shown in Supplementary Figure 3, which consists of three claddings. The first one is a fluoride doped layer that has a relatively lower index, limiting the signal light in the core. The combination of a second silica cladding and a third low-index coating layer allows the transmission of the illumination light in the second cladding layer.
Supplementary Figure 3. Section refractive index distribution of the fiber probe.

Supplementary Note 3: Waveforms
Here we discuss some features of the waveforms. When we put the ten waveforms from Fig. 3 of the main text together in Supplementary Figure 4, we see that all the sub-pulse peaks are overlapped in time, which proves that the mode dispersion actually dominates in the temporal evolution of the pulses so that the temporal delays of the sub-pulses will only be determined by the group delays of the corresponding modes. We observe from Supplementary Figure 4 that the burst of sub-pulses covers a time range of around 45 ns, a little less than the 50 ns predicted in Supplementary Note 1. This may be attributed to that certain highest-order modes are harder than expected to be excited, or that a trivial deviation of the parameters of the real fiber from the idea values. Besides, we can see that each waveform contains about 40 sub-pulse peaks which is much less than the number of LP modes in the MMF (136, see the calculation in Supplementary Note 1). This may be because that some adjacent modes have very close group delays so that the light energy in these modes is not completely separated in time domain. Instead, they will cause the broadening of each sub-pulse as shown in the figure, which may also contain some information due to the variation of shapes of the sub-pulse. When we reduce the length of the MMF, the waveforms will be shorter due to the reduced modal dispersion as shown in Supplementary Figure 5. The performance of the system when different lengths are adopted was tested and some recovered results are shown in Supplementary Figure 6. We see that the images can be restored with high quality until the length is reduced to 400 m, which corresponds to the pulse spreading of approximately 18.7 ns.    Supplementary Note 5: Image Classification We have tried two different network structures for classification of the images of hand-written digits. The first is a CNN network as shown in Supplementary Figure 9b. The waveforms are reshaped into 64×64 matrices as the input and the categories from 0 -9 as the output of the CNN. The second is a combination of the U-net (Supplementary Figure 9a) and the same CNN. The CNN model is pretrained with 60000 different images of hand-written digits so it can act as a digit classifier, which is then used to directly classify the images recovered by the U-net. We use the 20000 image/waveform pairs, including 17000 training data, 2000 valuation data and 1000 testing data to train and test the U-net model. The accuracies corresponding to the two structures: CNN and U-net + CNN, are tested to be 91.5% and 82.0% respectively as shown in Supplementary Figure 10. The results show that the combination of the two networks provide a higher accuracy, which is consistent with the previous research 3 . Supplementary Note 6: System Robustness For the temperature effect, we first collected 10000 image/waveform pairs with the indoor temperature fixed at approximately 25 ℃. These sample data were then used to train the neural network. Here, we still used the digit images for training and testing. Next, we changed the environmental temperature from 23 ℃ to 27 ℃ by adjusting the air conditioners and the waveforms of new images were collected at different temperatures. We collected 500 sample data every 0.5 ℃. In total, 4000 image/waveform pairs were collected and used to test the neural network. The recovery performance at different temperatures is shown by the blue line in Supplementary Figure  11a. In addition, we verified that this sensitivity can be inhibited by joint training. Joint training involves using the data collected at different temperatures instead of at a fixed temperature to train the neural network. In this way, the trained model can learn to adapt to imaging at different temperatures. To demonstrate this, we again collected 10000 training data with the temperature fixed at 25 ℃. Then, we adjusted the temperature from 23 ℃ to 27 ℃ continuously and collected another 8000 waveforms of new images at different temperatures, with 1000 data points every 0.5 ℃ change. Among these 8000 data, 4000 were used as for training and the other 4000 for test. Thus, a total of 14000 data were used to co-train the neural network. Next, the trained model was used to recover the 4000 test images from their waveforms. The results are shown by the red line in Supplementary  Figure 11a. We see that the imaging performance when the temperature changes is largely improved. The average fidelity remains above 70% in a large range of 23.5 ℃ -26.5 ℃, which is expected to expand further as more sample data collected at different temperatures are used to co-train the neural network. Some restored images are shown in Supplementary Figure 12 at different temperatures using the joint training. We see that the images can be restored fairly well from T = 23.5 ℃ to T = 26.5 ℃. Considering that the general indoor working temperature is within this range, our system owns certain practicability. In practical applications, the 1-km MMF and the laser source can be packed in a thermostatic container, which can further improve the robustness to the environmental temperature. To investigate the bending effect, we fixed the fiber-end-ball and bent the fiber probe into a semicircle as shown in the inset of Supplementary Figure 11b. We use the bending radius R to measure the bending degree. Similarly, 10000 image/waveform pairs are used to calibrate the system when the bending radius is fixed at 28 cm. Then, the fiber was bent and the radius was changed from 28 cm to 12 cm and we collected test data under 8 different bending states, with 500 data per state. These data were used for testing. The quality of the restored images at different bending states is shown in Supplementary Figure 11b by the blue line. We demonstrate the joint training in the same way: changing the bending radius from 28 cm to 12 cm and meanwhile collecting 8000 data at different bending states: 4000 for training and the other 4000 for testing. The initial 10000 data collected under R = 28 cm and the 4000 training data were combined together to co-train the network. Then it was used to recover the other 4000 images. The results are shown as the red line in Supplementary Figure 11b. We see that the influence of bending is much weakened. Some restored images at different states are shown in Supplementary Figure 13. We see that the images can be mostly restored until R = 19 cm.
Supplementary Figure 13. Several recovered images at different bending states after joint training.

Supplementary Note 7: Individual Illumination
Like most of the previous researches, we have tried to use an individual source to illuminate the images and an objective to couple light into the fiber probe as shown in Supplementary Figure 14. The source is collimated by a lens and illuminates the DMD where the images with a size of around 5×5 mm 2 are displayed. The reflected light is coupled into the 1-km MMF via a 40× objective. The rest parts of the system are totally same with that presented in the main text. Compared with the all-fiber system, this individual-illumination system has the advantages of providing a brighter illumination and is suitable for detecting relatively large objects due to the use of an objective. After the same training and testing, the recovery results show a fidelity of 83.0% and SSIM of 0.76. The comparison of the recovery performance of the two different systems are shown in Supplementary  Figure 15(a), which indicates that the two different illumination methods have similar performance and proves the high adaptability of our proposed method. The recovery of some other types of images of this system is also tested and shown in Supplementary Figure 15(b). Besides, we also test its classification performance and the confuse matrices are shown in Supplementary Figure 16 The first step is collecting waveform/image pairs for training. To explore the minimum number of pairs that is required for recovery, we trained the network with different number of samples and the results are in Supplementary Figure 18a. We see that 10000 samples are adequate for acquiring the optimal performance. Thus, at least 10000 waveform/image pairs should be collected. The training process with 10000 samples is shown in Supplementary Figure 18b. The model restrained after approximately 15 iterations, which took 4 minutes.